This document is the summary of the R for Data Analysis workshop.
All correspondence related to this document should be addressed to:
Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA)
Email: omidreza.ghasemi@hdr.mq.edu.auArtwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations
R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.
10 + 10
## [1] 20
4 ^ 2
## [1] 16
(250 / 500) * 100
## [1] 50
R is a bit flexible with spacing (but no spacing in the name of variables and words)
10+10
## [1] 20
10 + 10
## [1] 20
R can sometimes tell that you’re not finished yet
10 +
How to create a variable? Variable assignment using <- and =. Note that R is case sensitive for everything
pay <- 250
month = 12
pay * month
## [1] 3000
salary <- pay * month
Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).
Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to define it:
my_multiplier <- function(a,b){
result = a * b
return (result)
}
This code do nothing. To get a result, you need to call it:
my_multiplier (a=2, b=4)
## [1] 8
# or: my_multiplier (2, 4)
We can set a default value for our arguments:
my_multiplier2 <- function(a,b=4){
result = a * b
return (result)
}
my_multiplier2 (a=2)
## [1] 8
# or: my_multiplier (2)
# or: my_multiplier (2, 6)
Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:
round(54.6787)
## [1] 55
round(54.5787, digits = 2)
## [1] 54.58
Use ? before the function name to get some help. For example, ?round. You will see many functions in the rest of the workshop.
function class() is used to show what is the type of a variable.
TRUE, FALSE can be abbreviated as T, F. They has to be capital, ‘true’ is not a logical data:class(TRUE)
## [1] "logical"
class(F)
## [1] "logical"
class(2)
## [1] "numeric"
class(13.46)
## [1] "numeric"
class("ha ha ha ha")
## [1] "character"
class("56.6")
## [1] "character"
class("TRUE")
## [1] "character"
Can we change the type of data in a variable? Yes, you need to use the function as.---()
as.numeric(TRUE)
## [1] 1
as.character(4)
## [1] "4"
as.numeric("4.5")
## [1] 4.5
as.numeric("Hello")
## Warning: NAs introduced by coercion
## [1] NA
###Vector When there are more than one number or letter stored. Use the combine function c() for that.
sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)
sale <- c(1:10)
sale * sale
## [1] 1 4 9 16 25 36 49 64 81 100
Subsetting a vector:
days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
days[2]
## [1] "Sunday"
days[-2]
## [1] "Saturday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
days[c(2, 3, 4)]
## [1] "Sunday" "Monday" "Tuesday"
my_vector with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:my_vector <- (0:1000)
mean(my_vector)
## [1] 500
median(my_vector)
## [1] 500
min(my_vector)
## [1] 0
range(my_vector)
## [1] 0 1000
class(my_vector)
## [1] "integer"
sum(my_vector)
## [1] 500500
sd(my_vector)
## [1] 289.1081
List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.
my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4 5 6 7
##
## [[5]]
## [1] "HELLO"
##
## [[6]]
## [1] "hello"
##
## [[7]]
## [1] FALSE
Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with “male” and “female” entries:
gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)
R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).
summary(gender)
## female female male
## 1 2 3
gender
## [1] male male male female female female
## Levels: female female male
So, be careful of spaces!
rep() function):gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
## [1] male male male male male male male male male male
## [11] male male male male male male male male male male
## [21] male male male male male male male male male male
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## [51] female female female female female female female female female female
## [61] female female female female female female female female female female
## Levels: female male
There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the factor() function: ordered = TRUE, and levels = c("level1", "level2"). For example, we have a vector that shows participants’ education level.
edu<-c(3,2,3,4,1,2,2,3,4)
education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
## [1] College high school College Uni graduated Primary school
## [6] high school high school College Uni graduated
## Levels: Primary school < high school < College < Uni graduated
patient and control values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: control patient
health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: patient control
Finally, can you relabel both levels to uppercase characters? (Hint: check ?factor)
health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
## [1] Patient Patient Patient Patient Patient Control Control Control Control
## [10] Control
## Levels: Patient Control
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.
my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)
my_matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let’s create a dataframe:
id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5))
my_dataframe <-data.frame(Patient = id,
Treatment = group,
Response = response)
We also could have done the below
my_dataframe <-data.frame(Patient = c(1:200),
Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
Response = c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5)))
In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.
head(my_dataframe)
tail(my_dataframe)
| Patient | Treatment | Response | |
|---|---|---|---|
| 1 | 1 | Psychotherapy | 34.96260 |
| 2 | 2 | Psychotherapy | 31.64657 |
| 3 | 3 | Psychotherapy | 31.42481 |
| 4 | 4 | Psychotherapy | 25.43197 |
| 5 | 5 | Psychotherapy | 30.61577 |
| 6 | 6 | Psychotherapy | 24.99077 |
| Patient | Treatment | Response | |
|---|---|---|---|
| 195 | 195 | Medication | 27.30616 |
| 196 | 196 | Medication | 23.06917 |
| 197 | 197 | Medication | 27.53152 |
| 198 | 198 | Medication | 19.11493 |
| 199 | 199 | Medication | 16.60061 |
| 200 | 200 | Medication | 29.49956 |
Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:
my_dataframe[35, 3]
## [1] 25.46674
my_dataframe[1:10, ]
| Patient | Treatment | Response |
|---|---|---|
| 1 | Psychotherapy | 34.96260 |
| 2 | Psychotherapy | 31.64657 |
| 3 | Psychotherapy | 31.42481 |
| 4 | Psychotherapy | 25.43197 |
| 5 | Psychotherapy | 30.61577 |
| 6 | Psychotherapy | 24.99077 |
| 7 | Psychotherapy | 24.38427 |
| 8 | Psychotherapy | 25.79177 |
| 9 | Psychotherapy | 38.67245 |
| 10 | Psychotherapy | 29.08975 |
How to get only the Response column for all participants?
my_dataframe[ , 3]
## [1] 34.96260 31.64657 31.42481 25.43197 30.61577 24.99077 24.38427 25.79177
## [9] 38.67245 29.08975 26.50346 32.45890 25.38949 29.21747 32.84061 38.82994
## [17] 29.85224 29.54578 30.84349 30.83514 18.97348 28.97147 38.96407 37.99054
## [25] 29.02186 25.08161 33.75379 28.48740 30.57279 25.52159 30.03123 23.41081
## [33] 28.12926 22.78142 25.46674 36.18784 33.01753 27.74373 29.74098 30.20330
## [41] 29.92995 32.94010 40.79224 36.81551 27.13167 27.63639 26.61317 30.40545
## [49] 31.03785 34.30708 23.70042 32.45592 31.49774 31.74705 28.77679 34.88093
## [57] 26.54407 26.61226 35.94736 29.92079 29.61376 38.78326 26.69423 24.48900
## [65] 27.64699 26.76344 27.67637 23.96007 31.19174 23.39417 35.35022 27.59655
## [73] 30.83045 34.74902 33.38851 35.21690 39.41601 31.79024 37.49004 21.51096
## [81] 21.89162 20.60157 36.45890 29.56125 28.96662 30.09829 25.55787 26.78051
## [89] 30.89210 26.76469 29.76614 28.95641 31.50727 43.91029 28.75246 24.85764
## [97] 29.84467 29.89428 26.33818 34.06545 25.56559 16.97324 21.13960 26.02983
## [105] 24.82091 16.76352 21.88050 24.02214 22.90504 26.95942 22.01723 19.48741
## [113] 28.81998 22.95709 31.45071 31.36137 22.83711 18.04427 24.49852 22.97583
## [121] 16.35870 19.90071 35.51241 30.02796 22.32228 22.57951 25.18171 27.55821
## [129] 30.33464 21.47439 21.22207 28.31029 27.43622 27.16737 23.91594 20.91610
## [137] 29.50764 16.38556 24.00856 23.30033 24.19752 25.03772 23.84564 15.86221
## [145] 18.88769 29.20390 22.57849 21.08188 21.75392 16.57595 20.59390 26.39006
## [153] 24.39367 26.56404 27.32385 25.49716 28.43724 20.55394 31.22679 19.45794
## [161] 13.23348 32.51918 26.37662 23.44986 29.58656 34.81423 24.71990 21.96679
## [169] 21.53439 32.79687 31.27792 21.36931 29.99414 25.87782 23.80425 19.44767
## [177] 26.15893 21.80083 20.95385 30.22909 24.62255 22.27265 29.65765 11.75282
## [185] 24.66539 23.07453 25.09461 26.28304 20.69232 26.80548 24.25434 16.25224
## [193] 23.13111 22.90094 27.30616 23.06917 27.53152 19.11493 16.60061 29.49956
Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:
my_dataframe[ , "Response"]
# OR:
my_dataframe$Response
So far, we created dataframes using data.frame function from the base R. However, a better way to create dataframes is to use the tibble function from tidyverse (see here).